智能论文笔记

To remove or not remove Mobile Apps? A data-driven predictive model approach

Fadi Mohsen , Dimka Karastoyanova , George Azzopardi

分类：机器学习

2022-06-08

移动应用商店是移动应用程序的关键分销商。他们定期将审核流程应用于部署的应用程序。然而，其中一些审查过程可能不足或迟到。延迟删除应用程序可能会对开发人员和用户产生不愉快的后果。因此，在这项工作中，我们提出了一种数据驱动的预测方法，该方法决定了是否将删除或接受相应的应用程序。它还表明了功能的相关性，可以帮助利益相关者进行解释。反过来，我们的方法可以支持开发人员改善其应用程序和用户下载不太可能被删除的应用程序。我们专注于Google App Store，并编译了870,515个应用程序的新数据集，其中56％实际上已从市场中删除。我们提出的方法是多个XGBoost机器学习分类器的引导程序聚合。我们提出了两种模型：使用47个功能以用户为中心，并以37个功能为中心，仅在部署之前可用。我们在测试集的ROC曲线（AUC）下实现以下区域：以用户为中心= 0.792，以开发人员为中心= 0.762。

translated by 谷歌翻译

Asking Clarification Questions for Code Generation in General-Purpose Programming Language

Haau-Sing Li , Mohsen Mesgar , André F. T. Martins , Iryna Gurevych

分类：自然语言处理

2022-12-19

Code generation from text requires understanding the user's intent from a natural language description (NLD) and generating an executable program code snippet that satisfies this intent. While recent pretrained language models (PLMs) demonstrate remarkable performance for this task, these models fail when the given NLD is ambiguous due to the lack of enough specifications for generating a high-quality code snippet. In this work, we introduce a novel and more realistic setup for this task. We hypothesize that ambiguities in the specifications of an NLD are resolved by asking clarification questions (CQs). Therefore, we collect and introduce a new dataset named CodeClarQA containing NLD-Code pairs with created CQAs. We evaluate the performance of PLMs for code generation on our dataset. The empirical results support our hypothesis that clarifications result in more precise generated code, as shown by an improvement of 17.52 in BLEU, 12.72 in CodeBLEU, and 7.7\% in the exact match. Alongside this, our task and dataset introduce new challenges to the community, including when and what CQs should be asked.

translated by 谷歌翻译

GAN-based Tabular Data Generator for Constructing Synopsis in Approximate Query Processing: Challenges and Solutions

Mohammadali Fallahian , Mohsen Dorodchi , Kyle Kreth

分类：机器学习

2022-12-18

In data-driven systems, data exploration is imperative for making real-time decisions. However, big data is stored in massive databases that are difficult to retrieve. Approximate Query Processing (AQP) is a technique for providing approximate answers to aggregate queries based on a summary of the data (synopsis) that closely replicates the behavior of the actual data, which can be useful where an approximate answer to the queries would be acceptable in a fraction of the real execution time. In this paper, we discuss the use of Generative Adversarial Networks (GANs) for generating tabular data that can be employed in AQP for synopsis construction. We first discuss the challenges associated with constructing synopses in relational databases and then introduce solutions to those challenges. Following that, we organized statistical metrics to evaluate the quality of the generated synopses. We conclude that tabular data complexity makes it difficult for algorithms to understand relational database semantics during training, and improved versions of tabular GANs are capable of constructing synopses to revolutionize data-driven decision-making systems.

translated by 谷歌翻译

Lisan: Yemenu, Irqi, Libyan, and Sudanese Arabic Dialect Copora with Morphological Annotations

Mustafa Jarrar , Fadi A Zaraket , Tymaa Hammouda , Daanish Masood Alavi , Martin Waahlisch

分类：自然语言处理

2022-12-13

This article presents morphologically-annotated Yemeni, Sudanese, Iraqi, and Libyan Arabic dialects Lisan corpora. Lisan features around 1.2 million tokens. We collected the content of the corpora from several social media platforms. The Yemeni corpus (~ 1.05M tokens) was collected automatically from Twitter. The corpora of the other three dialects (~ 50K tokens each) came manually from Facebook and YouTube posts and comments. Thirty five (35) annotators who are native speakers of the target dialects carried out the annotations. The annotators segemented all words in the four corpora into prefixes, stems and suffixes and labeled each with different morphological features such as part of speech, lemma, and a gloss in English. An Arabic Dialect Annotation Toolkit ADAT was developped for the purpose of the annation. The annotators were trained on a set of guidelines and on how to use ADAT. We developed ADAT to assist the annotators and to ensure compatibility with SAMA and Curras tagsets. The tool is open source, and the four corpora are also available online.

translated by 谷歌翻译

Fast Learning of Multidimensional Hawkes Processes via Frank-Wolfe

Renbo Zhao , Niccolò Dalmasso , Mohsen Ghassemi , Vamsi K. Potluru , Tucker Balch , Manuela Veloso

分类：机器学习

2022-12-12

Hawkes processes have recently risen to the forefront of tools when it comes to modeling and generating sequential events data. Multidimensional Hawkes processes model both the self and cross-excitation between different types of events and have been applied successfully in various domain such as finance, epidemiology and personalized recommendations, among others. In this work we present an adaptation of the Frank-Wolfe algorithm for learning multidimensional Hawkes processes. Experimental results show that our approach has better or on par accuracy in terms of parameter estimation than other first order methods, while enjoying a significantly faster runtime.

translated by 谷歌翻译

Multi-Task Edge Prediction in Temporally-Dynamic Video Graphs

Osman Ülger , Julian Wiederer , Mohsen Ghafoorian , Vasileios Belagiannis , Pascal Mettes

分类：计算机视觉

2022-12-06

Graph neural networks have shown to learn effective node representations, enabling node-, link-, and graph-level inference. Conventional graph networks assume static relations between nodes, while relations between entities in a video often evolve over time, with nodes entering and exiting dynamically. In such temporally-dynamic graphs, a core problem is inferring the future state of spatio-temporal edges, which can constitute multiple types of relations. To address this problem, we propose MTD-GNN, a graph network for predicting temporally-dynamic edges for multiple types of relations. We propose a factorized spatio-temporal graph attention layer to learn dynamic node representations and present a multi-task edge prediction loss that models multiple relations simultaneously. The proposed architecture operates on top of scene graphs that we obtain from videos through object detection and spatio-temporal linking. Experimental evaluations on ActionGenome and CLEVRER show that modeling multiple relations in our temporally-dynamic graph network can be mutually beneficial, outperforming existing static and spatio-temporal graph neural networks, as well as state-of-the-art predicate classification methods.

translated by 谷歌翻译

Longest Common Substring in Longest Common Subsequence's Solution Service: A Novel Hyper-Heuristic

Alireza Abdi , Masih Hajsaeedi , Mohsen Hooshmand

分类：人工智能

2022-12-03

The Longest Common Subsequence (LCS) is the problem of finding a subsequence among a set of strings that has two properties of being common to all and is the longest. The LCS has applications in computational biology and text editing, among many others. Due to the NP-hardness of the general longest common subsequence, numerous heuristic algorithms and solvers have been proposed to give the best possible solution for different sets of strings. None of them has the best performance for all types of sets. In addition, there is no method to specify the type of a given set of strings. Besides that, the available hyper-heuristic is not efficient and fast enough to solve this problem in real-world applications. This paper proposes a novel hyper-heuristic to solve the longest common subsequence problem using a novel criterion to classify a set of strings based on their similarity. To do this, we offer a general stochastic framework to identify the type of a given set of strings. Following that, we introduce the set similarity dichotomizer ($S^2D$) algorithm based on the framework that divides the type of sets into two. This algorithm is introduced for the first time in this paper and opens a new way to go beyond the current LCS solvers. Then, we present a novel hyper-heuristic that exploits the $S^2D$ and one of the internal properties of the set to choose the best matching heuristic among a set of heuristics. We compare the results on benchmark datasets with the best heuristics and hyper-heuristics. The results show a higher performance of our proposed hyper-heuristic in both quality of solutions and run time factors.

translated by 谷歌翻译

GRelPose: Generalizable End-to-End Relative Camera Pose Regression

Fadi Khatib , Yuval Margalit , Meirav Galun , Ronen Basri

分类：计算机视觉

2022-11-27

This paper proposes a generalizable, end-to-end deep learning-based method for relative pose regression between two images. Given two images of the same scene captured from different viewpoints, our algorithm predicts the relative rotation and translation between the two respective cameras. Despite recent progress in the field, current deep-based methods exhibit only limited generalization to scenes not seen in training. Our approach introduces a network architecture that extracts a grid of coarse features for each input image using the pre-trained LoFTR network. It subsequently relates corresponding features in the two images, and finally uses a convolutional network to recover the relative rotation and translation between the respective cameras. Our experiments indicate that the proposed architecture can generalize to novel scenes, obtaining higher accuracy than existing deep-learning-based methods in various settings and datasets, in particular with limited training data.

translated by 谷歌翻译

Warmup and Transfer Knowledge-Based Federated Learning Approach for IoT Continuous Authentication

Mohamad Wazzeh , Hakima Ould-Slimane , Chamseddine Talhi , Azzam Mourad , Mohsen Guizani

分类：机器学习 | 人工智能

2022-11-10

Continuous behavioural authentication methods add a unique layer of security by allowing individuals to verify their unique identity when accessing a device. Maintaining session authenticity is now feasible by monitoring users' behaviour while interacting with a mobile or Internet of Things (IoT) device, making credential theft and session hijacking ineffective. Such a technique is made possible by integrating the power of artificial intelligence and Machine Learning (ML). Most of the literature focuses on training machine learning for the user by transmitting their data to an external server, subject to private user data exposure to threats. In this paper, we propose a novel Federated Learning (FL) approach that protects the anonymity of user data and maintains the security of his data. We present a warmup approach that provides a significant accuracy increase. In addition, we leverage the transfer learning technique based on feature extraction to boost the models' performance. Our extensive experiments based on four datasets: MNIST, FEMNIST, CIFAR-10 and UMDAA-02-FD, show a significant increase in user authentication accuracy while maintaining user privacy and data security.

translated by 谷歌翻译

BERT on a Data Diet: Finding Important Examples by Gradient-Based Pruning

Mohsen Fayyaz , Ehsan Aghazadeh , Ali Modarressi , Mohammad Taher Pilehvar , Yadollah Yaghoobzadeh , Samira Ebrahimi Kahou

分类：自然语言处理

2022-11-10

Current pre-trained language models rely on large datasets for achieving state-of-the-art performance. However, past research has shown that not all examples in a dataset are equally important during training. In fact, it is sometimes possible to prune a considerable fraction of the training set while maintaining the test performance. Established on standard vision benchmarks, two gradient-based scoring metrics for finding important examples are GraNd and its estimated version, EL2N. In this work, we employ these two metrics for the first time in NLP. We demonstrate that these metrics need to be computed after at least one epoch of fine-tuning and they are not reliable in early steps. Furthermore, we show that by pruning a small portion of the examples with the highest GraNd/EL2N scores, we can not only preserve the test accuracy, but also surpass it. This paper details adjustments and implementation choices which enable GraNd and EL2N to be applied to NLP.

translated by 谷歌翻译